# EgoMemory: Memory-Augmented Personalized Retrieval for Long-Context Egocentric Video

## Overview

This repository contains the code and data samples for our paper, **"EgoMemory: Memory-Augmented Personalized Retrieval for Long-Context Egocentric Video,"** submitted anonymously to NeurIPS 2025.


## Repository Structure

```
├── EgoMemory_constructer.py        # Script to generate EgoMemory benchmark annotations
├── sample_data/                    # Contains example benchmark data and video clips
│   ├── EgoMemory_samples.csv       # Sample benchmark data for evaluation
│   ├── EgoMemory_visualize_sample.csv # Sample visualization for qualitative inspection
│   ├── ad851441-1f15-467e-83d8-48c764e220a8_26880_26909.jpg # Example reference frame
│   └── sample_top5_retrieved_videos/ # Directory for top-5 retrieved videos
├── EgoRetrieval/
│   ├── egomomory_retrieval.py      # Script for running EgoRetriever and evaluation
│   ├── model/                      # Directory containing models used for retrieval
│   └── prompts.py                  # Script containing retrieval prompts
├── data/                           # Directory for Ego4D dataset (user provided)
```

## Installation

### Data Preparation

Download the Ego4D dataset and annotations:

```bash
ego4d --output_directory="./data/ego4d_data/" --datasets full_scale annotations --metadata
```

## EgoMemory Benchmark Construction

Construct the EgoMemory benchmark with MLLM-generated annotations:

```bash
python EgoMemory_constructer.py --openai_engine gpt-4o-20241120 --dataset-path ./data/ego4d_data/annotations --output_pth ./egomemory
```

## Sample Visualization from EgoMemory Benchmark

To facilitate qualitative inspection of our retrieval system's performance, we provide `EgoMemory_visualize_sample.csv` in the `sample_data/` directory. This file contains structured entries for a subset of queries along with their corresponding ground truth and the top-5 retrieved video clips. Retrieval results are stored in `sample_top5_retrieved_videos/`.

An excerpt from a representative sample is shown below:

| `person_id` | `query` | `relevant_personal_memory` | `reference_frame`                                                                                           | `reference_video_object_attribute` | `target_video_description` | `target_clip` | `Recall@1` | `Recall@2` | `Recall@3` | `Recall@4` | `Recall@5` |
|-------------|---------|-----------------------------|-------------------------------------------------------------------------------------------------------------|-------------------------------------|------------------------------|----------------|-------------|-------------|-------------|-------------|-------------|
| 137 | *Who did I interact with when I played with the dog in the living room?* | *(semantic attributes related to object category, color, texture, etc.)* | ![ad851441-1f15...20a8_26880_26909.jpg](./sample_data/ad851441-1f15-467e-83d8-48c764e220a8_26880_26909.jpg) | `{ "major_category": "animal", "subcategory": "dog", "color": "black and white", ... }` | *A man interacts with a black and white border collie on a brown sofa.* | [93231c7e..._30420_30692](./sample_data/sample_top5_retrieved_videos/93231c7e-1cf4-4a20-b1f8-9cc9428915b2_30420_30692.mp4) | [93231c7e..._30420_30692](./sample_data/sample_top5_retrieved_videos/93231c7e-1cf4-4a20-b1f8-9cc9428915b2_30420_30692.mp4) | [6c2849cb..._16837_16944](./sample_data/sample_top5_retrieved_videos/6c2849cb-d6bb-432e-b4ae-8b8c4837ad8b_16837_16944.mp4) | [6c2849cb..._14282_14407](./sample_data/sample_top5_retrieved_videos/6c2849cb-d6bb-432e-b4ae-8b8c4837ad8b_14282_14407.mp4) | [2c2bda8d..._11465_11539](./sample_data/sample_top5_retrieved_videos/2c2bda8d-69a3-4a90-9ad6-f6715bc99f39_11465_11539.mp4) | [93231c7e..._32850_33060](./sample_data/sample_top5_retrieved_videos/93231c7e-1cf4-4a20-b1f8-9cc9428915b2_32850_33060.mp4) |

This example illustrates our benchmark's focus on **fine-grained multimodal retrieval**, where models are tasked with matching a user's natural language query against egocentric memories, guided by semantically rich visual and contextual cues.

We also provide 150 benchmark samples (without video), which is available in [`EgoMemory_samples.csv`](./sample_data/EgoMemory_samples.csv), serving as the basis for both **quantitative evaluation** and **model comparison** across various retrieval architectures.


## EgoRetriever: Reflective Chain-of-Thought for Personal Egocentric Video Retrieval

Evaluate our EgoRetriever method:

```bash
python EgoRetrieval/egomomory_retrieval.py \
  --openai_engine gpt-4o-20241120 \
  --gpt_prompt prompts.mllm_CoT_target_video_description \
  --model languagebind egovlpv2 \
  --modalities text text
```

### Additional Models Evaluation

Example commands for evaluating other models:

```bash
# CLIP
python EgoRetrieval/egomomory_retrieval.py --evaluation global --model clip --modalities visual-text --text instruction

# BLIP
python EgoRetrieval/egomomory_retrieval.py --evaluation global --model blip --modalities visual-text --text instruction

# LanguageBind
python EgoRetrieval/egomomory_retrieval.py --evaluation global --model languagebind --modalities visual-text --text instruction

# BLIP_CoVR
python EgoRetrieval/egomomory_retrieval.py --evaluation global --model blip --modalities visual-text --text instruction --fusion crossattn --finetuned
```




